AITopics | Samangan Province

Collaborating Authors

Samangan Province

IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

Khan, Mohammed Safi Ur Rahman, Mehta, Priyam, Sankar, Ananth, Kumaravelan, Umashankar, Doddapaneni, Sumanth, G, Suriyaprasaad, G, Varun Balan, Jain, Sparsh, Kunchukuttan, Anoop, Kumar, Pratyush, Dabre, Raj, Khapra, Mitesh M.

arXiv.org Artificial IntelligenceMar-10-2024

Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.

computational linguistic, dataset, indian language, (15 more...)

arXiv.org Artificial Intelligence

2403.0635

Country:

Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Indonesia > Bali (0.04)
(37 more...)

Genre:

Instructional Material (1.00)
Research Report > New Finding (0.67)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Law (0.92)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks

Halterman, Andrew, Schrodt, Philip A., Beger, Andreas, Bagozzi, Benjamin E., Scarborough, Grace I.

arXiv.org Artificial IntelligenceApr-3-2023

Event data, or structured records of ``who did what to whom'' that are automatically extracted from text, is an important source of data for scholars of international politics. The high cost of developing new event datasets, especially using automated systems that rely on hand-built dictionaries, means that most researchers draw on large, pre-existing datasets such as ICEWS rather than developing tailor-made event datasets optimized for their specific research question. This paper describes a ``bag of tricks'' for efficient, custom event data production, drawing on recent advances in natural language processing (NLP) that allow researchers to rapidly produce customized event datasets. The paper introduces techniques for training an event category classifier with active learning, identifying actors and the recipients of actions in text using large language models and standard machine learning classifiers and pretrained ``question-answering'' models from NLP, and resolving mentions of actors to their Wikipedia article to categorize them. We describe how these techniques produced the new POLECAT global event dataset that is intended to replace ICEWS, along with examples of how scholars can quickly produce smaller, custom event datasets. We publish example code and models to implement our new techniques.

category, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2304.01331

Country:

North America > United States > Kansas (0.04)
Asia > Middle East > Syria > Damascus Governorate > Damascus (0.04)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)
(17 more...)

Genre: Research Report > New Finding (0.65)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Government > Military (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback